- 
                Notifications
    You must be signed in to change notification settings 
- Fork 641
[Fix] kubectl ray create cluster config file CPU overwrites the whole resource requests and limits #3811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. You can find if there are any helper libraries or tools to handle object merging more elegantly as follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t have enough time to review this PR carefully, but we should be very cautious when implementing this kind of complex merge logic. It’s hard to maintain and prone to errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should find a way to better support the merge logic. In addition, this is not a release blocker.
| Hi @kevin85421 , I have refactored the merge logic. PTAL | 
| config.Namespace = overrideConfig.Namespace | ||
| config.Name = overrideConfig.Name | ||
| config.ServiceAccount = overrideConfig.ServiceAccount | ||
| config.GKE = overrideConfig.GKE | ||
| config.Autoscaler = overrideConfig.Autoscaler | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not in default config, so I assign the value directly
| assert.Equal(t, expected, result) | ||
| } | ||
|  | ||
| func TestMergeWithDefaults(t *testing.T) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid adding tests for private methods. Instead, test public methods if possible. In this case, write tests for ParseConfigFile.
| @kevin85421, I have removed the private function testing and move them to  | 
| return config, nil | ||
| } | ||
|  | ||
| func mergeWithDefaultConfig(overrideConfig *RayClusterConfig) (*RayClusterConfig, error) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this function be simplified a bit ? It seems not need to handle some merges manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fscnick thanks for reviewing
I initially considered using mergo.Merge(config, &overrideConfig, mergo.WithOverride), but it introduces an unintended side effect when merging slice fields like worker-groups.
For example, consider the following minimal config.yaml:
worker-groups:
  - cpu: 3If we directly call:
config := newRayClusterConfigWithDefaults()
err := mergo.Merge(config, &overrideConfig, mergo.WithOverride)Then overrideConfig.WorkerGroups will completely overwrite the default WorkerGroups defined in config, instead of preserving and partially overriding fields (e.g., memory limits or container settings). This leads to missing default values in the final rendered output. As a result, the merged config will produce:
resources:
  limits: {}
  requests:
    cpu: "3"But the expected output should retain default memory settings like:
resources:
  limits:
    memory: 4Gi
  requests:
    cpu: "3"
    memory: 4GiSo, I used a more sophisticated approach to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we omit mergo.WithOverride and simply call:
config := newRayClusterConfigWithDefaults()
err := mergo.Merge(config, &overrideConfig)Then the fields in overrideConfig will only be merged if they are not already set in config. As a result, the WorkerGroups defined in config.yaml will be ignored entirely if the default already includes a value. This leads to the user-provided override being silently ignored.
For example, the output might look like this:
containers:
- image: rayproject/ray:2.46.0
  name: ray-worker
  resources:
    limits:
      memory: 4Gi
    requests:
      cpu: "2"
      memory: 4GiHere, the cpu: 3 specified by the user in config.yaml is not reflected in the final output — the default cpu: 2 remains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three thoughts about handling this slice.
- detach the WorkerGroups, merge it manually and attach back to config.
// detach WorkerGroups
overrideConfigWG := overrideConfig.WorkerGroups
overrideConfig.WorkerGroups = nil
// merge config and overrideConfig
...
// merge WorkerGroups
...
// put back
config.WorkerGroups = mergedWorkerGroups- use Transformerin mergo to skipWorkerGroupsand merge it manually.
- use Transformerin mergo to merge it directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea !! thx
| Hi @fscnick ,I have updated the code and test manually, PTAL | 
| overrideConfigWG := overrideConfig.WorkerGroups | ||
| overrideConfig.WorkerGroups = nil | ||
|  | ||
| config := newRayClusterConfigWithDefaults() | ||
| err = mergo.Merge(config, &overrideConfig, mergo.WithOverride) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to merge config with defaults: %w", err) | ||
| } | ||
| // merge WorkerGroups and keep the default values for missing fields | ||
| // if overrideConfigWG is not nil, we will merge the worker groups from the config file | ||
| // and keep the default values for missing fields | ||
| if overrideConfigWG != nil { | ||
| for len(config.WorkerGroups) < len(overrideConfigWG) { | ||
| config.WorkerGroups = append(config.WorkerGroups, WorkerGroup{ | ||
| Replicas: util.DefaultWorkerReplicas, | ||
| CPU: ptr.To(util.DefaultWorkerCPU), | ||
| Memory: ptr.To(util.DefaultWorkerMemory), | ||
| }) | ||
| } | ||
| for i, workerGroup := range overrideConfigWG { | ||
| err := mergo.Merge(&config.WorkerGroups[i], workerGroup, mergo.WithOverride) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to merge worker group %d: %w", i, err) | ||
| } | ||
| } | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wrap the merging code for readability as previously did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem
| Hi @kevin85421 PTAL | 
| | nameOverride | string | `"kuberay"` | String to partially override release name. | | ||
| | fullnameOverride | string | `""` | String to fully override release name. | | ||
| | imagePullSecrets | list | `[]` | Secrets with credentials to pull images from a private registry | | ||
| | gcsFaultTolerance.enabled | bool | `false` | | | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase as this had been fixed in #3890
Signed-off-by: Cheyu Wu <[email protected]>
| After giving it some more thought, I don’t think “merge” is the right behavior. It’s hard to get the “merge” behavior right. For example, if users don’t want to specify a memory resource request or limit, there’s no way to do that here. I won’t accept this PR without a strong reason. | 
| 
 Got it, that makes sense. Thanks for the clarification — I’ll go ahead and close this MR. | 
Why are these changes needed?
Root Cause
This issue occurs because the
dataoverrides the default values in&config, causing the defaults to be lost.How I Solved It
I created a
MergeWithDefaultsfunction to merge theRayClusterConfigfield by field.I know this approach isn't particularly elegant, but I couldn’t find a better way to merge deeply nested structs.
If you have a better suggestion, I'd be happy to hear it.
Manual testing
config.yamlDry Run the command
Run the command
$ kubectl ray create cluster test --file ./config.yamlconfig2.yamlDry run the command
Result:
Related issue number
Closes #3803
Checks